Many features have been proposed for speech-based emotion recognition, and a majority of them are frame based or\r\nstatistics estimated from frame-based features. Temporal information is typically modelled on a per utterance basis, with\r\neither functionals of frame-based features or a suitable back-end. This paper investigates an approach that combines\r\nboth, with the use of temporal contours of parameters extracted from a three-component model of speech production\r\nas features in an automatic emotion recognition system using a hidden Markov model (HMM)-based back-end.\r\nConsequently, the proposed system models information on a segment-by-segment scale is larger than a frame-based\r\nscale but smaller than utterance level modelling. Specifically, linear approximations to temporal contours of formant\r\nfrequencies, glottal parameters and pitch are used to model short-term temporal information over individual segments\r\nof voiced speech. This is followed by the use of HMMs to model longer-term temporal information contained in\r\nsequences of voiced segments. Listening tests were conducted to validate the use of linear approximations in this\r\ncontext. Automatic emotion classification experiments were carried out on the Linguistic Data Consortium emotional\r\nprosody speech and transcripts corpus and the FAU Aibo corpus to validate the proposed approach.
Loading....